LNCS3156 




Marc Joye 

Jean-Jacques Quisquater (Eds.) 



Cryptographic Hardware 
and Embedded Systems - 



CHES 2004 

6th International Workshop 
Cambridge, MA, USA, August 2004 
Proceedings 




4^ Spri 



ringer 



Lecture Notes in Computer Science 

Commenced Publication in 1 973 
Founding and Former Series Editors: 

Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen 



Editorial Board 

David Hutchison 

Lancaster University, UK 
Takeo Kanade 

Carnegie Mellon University, Pittsburgh, PA, USA 
Josef Kittler 

University of Surrey, Guildford, UK 
Jon M. Kleinberg 

Cornell University, Ithaca, NY, USA 
Friedemann Mattern 

ETH Zurich, Switzerland 
John C. Mitchell 

Stanford University, CA, USA 
Moni Naor 

Weizmann Institute of Science, Rehovot, Israel 
Oscar Nierstrasz 

University of Bern, Switzerland 
C. Pandu Rangan 

Indian Institute of Technology, Madras, India 
Bernhard Steffen 

University of Dortmund, Germany 
Madhu Sudan 

Massachusetts Institute of Technology, MA, USA 
Demetri Terzopoulos 

New York University, NY, USA 
Doug Tygar 

University of California, Berkeley, CA, USA 
Moshe Y. Vardi 

Rice University, Houston, IX, USA 
Gerhard Weikum 

Max- Planck Institute of Computer Science, Saarbruecken, Germany 



3156 




Marc Joye Jean-Jacques Quisquater (Eds.) 



Cryptographic Hardware 
and Embedded Systems - 



CHES 2004 



6th International Workshop 

Cambridge, MA, USA, August 11-13, 2004 

Proceedings 



Springer 




Volume Editors 



Marc Joye 

Gemplus, Card Security Group 
La Vigie, Avenue du Jujubier, ZI Athelia IV 
13705 La Ciotat Cedex, France 
E-mail: marc.joye@gemplus.com 

Jean-Jacques Quisquater 
Universite Catholique de Louvain 
UCL Crypto Group 
Place du Levant 3 
1348 Louvain-la-Neuve, Belgium 
E-mail: jjq@dice.ucl.ac.be 



Library of Congress Control Number: 2004109601 



CR Subject Classification (1998): E.3, C.2, C.3, B.7, G.2.1, D.4.6, K.6.5, F.2.1, J.2 
ISSN 0302-9743 

ISBN 3-540-22666-4 Springer Berlin Heidelberg New York 



This work is subject to copyright. All rights are reserved, whether the whole or part of the material is 
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, 
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication 
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, 
in its current version, and permission for use must always be obtained from Springer. Violations are liable 
to prosecution under the German Copyright Law. 

Springer is a part of Springer Science+Business Media 

springeronline.com 

© International Association for Cryptologic Research 2004 
Printed in Germany 

Typesetting: Camera-ready by author, data conversion by PTP-Berlin, Protago-TeX-Production GmbH 
Printed on acid-free paper SPIN: 1 1307204 06/3 142 5 4 3 2 1 0 




Preface 



These are the proceedings of CHES 2004, the 6tlr Workshop on Cryptographic 
Hardware and Embedded Systems. For the first time, the CHES Workshop was 
sponsored by the International Association for Cryptologic Research (IACR). 

This year, the number of submissions reached a new record. One hundred 
and twenty-five papers were submitted, of which 32 were selected for presenta- 
tion. Each submitted paper was reviewed by at least 3 members of the program 
committee. We are very grateful to the program committee for their hard and 
efficient work in assembling the program. We are also grateful to the 108 external 
referees who helped in the review process in their area of expertise. 

In addition to the submitted contributions, the program included three in- 
vited talks, by Neil Gerslrenfeld (Center for Bits and Atoms, MIT) about “Phys- 
ical Information Security”, by Isaac Chuang (Medialab, MIT) about “Quantum 
Cryptography”, and by Paul Koclrer (Cryptography Research) about “Physi- 
cal Attacks”. It also included a rump session, chaired by Christof Paar, which 
featured informal talks on recent results. 

As in the previous years, the workshop focused on all aspects of cryptographic 
hardware and embedded system security. We sincerely hope that the CHES 
Workshop series will remain a premium forum for intellectual exchange in this 
area. 

This workshop would not have been possible without the involvement of 
several persons. In addition to the program committee members and the external 
referees, we would like to thank Christof Paar and Berk Sunar for their help on 
local organization. Special thanks also go to Karsten Tellmann for maintaining 
the Web pages and to Julien Brouchier for installing and running the submission 
and reviewing softwares of K.U. Leuven. Last but not least, we would like to 
thank all the authors who submitted papers, making the workshop possible, and 
the authors of accepted papers for their cooperation. 
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Towards Efficient Second-Order Power Analysis 



Jason Waddle and David Wagner 
University of California at Berkeley 



Abstract. Viable cryptosystem designs must address power analysis 
attacks, and masking is a commonly proposed technique for defend- 
ing against these side-channel attacks. It is possible to overcome sim- 
ple masking by using higher-order techniques, but apparently only at 
some cost in terms of generality, number of required samples from the 
device being attacked, and computational complexity. We make progress 
towards ascertaining the significance of these costs by exploring a cou- 
ple of attacks that attempt to efficiently employ second-order techniques 
to overcome masking. In particular, we consider two variants of second- 
order differential power analysis: Zero-Offset 2DPA and FFT 2DPA. 



1 Introduction 

Power analysis is a major concern for designers of smartcards and other embed- 
ded cryptosystems. The advance of Differential Power Analysis (DPA) in 1998 
by Paul Koclrer [1] made power analysis attacks even more practical since an 
attacker using DPA did not need to know very much about the device being 
attacked. 

The technique of masking or duplication is commonly suggested as a way 
to stymie first-order power attacks, including DPA. In order to defeat masking, 
attacks would have to correlate the power consumption at multiple times during 
a single computation. Attacks of this sort were suggested and investigated (for 
example, by Thomas Messerges [2]), but it seems that the attacker was once 
again required to know significant details about the device under analysis. 

This paper attempts to make progress towards a second-order analog of Dif- 
ferential Power Analysis. To this end, we suggest two second-order attacks, nei- 
ther of which require much more time than straight DPA, but which are able to 
defeat some countermeasures. These attacks are basically preprocessing routines 
that attempt to correlate power traces with themselves and then apply standard 
DPA to the results. 

In Section 2, we give some background and contrast first-order and second- 
order power analysis techniques. We also discuss the apparently inherent costs 
of higher-order attacks. 

In Section 3, we present our model and give the intuition behind our tech- 
niques. 

In Section 4, we give some techniques for second-order power analysis. In 
particular, we present some algorithms and analyze them in terms of limitations 
and requirements: generality, runtime, and number of required traces. 
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Section 5 contains some closing remarks, and Appendix A gives the formal 
derivations for the noise amplifications that are behind the limitations of the 
attacks in Section 4. 



2 First-Order and Second-Order Power Analysis 

We consider a cryptosystem that takes an input, performs some computations 
that combine this input and some internally stored secret, and produces an 
output. For concreteness, we will refer to this computation as an encryption , an 
input as a plaintext , the secret as a key , and the output as a ciphertext, though 
it is not necessary that the device actually be encrypting. An attacker would 
like to extract the secret from this device. If the attacker uses only the input 
and output information (i.e. , the attacker treats the cryptosystem as a “black 
box”), it is operating in a traditional private-computation model; in this case, 
the secret’s safety is entirely up to the algorithm implemented by the device. 

In practice, however, the attacker may have access to some more side-channel 
information about the device’s computation; if this extra information is corre- 
lated with the secret, it may be exploitable. This information can come from 
a variety of observables: timing, electromagnetic radiation, power consumption, 
etc. Since power consumption can usually be measured by externally probing the 
connection of the device with its power supply, it is one of the easiest of these 
side-channels to exploit, and it is our focus in this discussion. 



2.1 First-Order Power Analysis Attacks 

First-order attacks are characterized by the property that they exploit highly lo- 
cal correlation of the secret with the power trace. Typically, the secret-correlated 
power draw occurs at a consistent time during the encryption and has consistent 
sign and magnitude. 



Simple Power Analysis (SPA). In simple first-order power analysis attacks, 
the adversary is assumed to have some fairly explicit knowledge of the analyzed 
cryptosystem. In particular, he knows the time at which the power consumption 
is correlated with part of the secret. By measuring the power consumption at 
this time (and perhaps averaging over a few encryptions to reduce the ambiguity 
introduced by noise), he gains some information about the key. 

As a simple example, suppose the attacker knows that the first bit of the 
key k 0 is loaded into a register at 100/zs into the encryption. The average power 
draw at 100/xs is m, but when the ko is 0 this average is m — 5 and when k 0 is 
1, this average is m + S. Given enough samples of the power draw at 100/rs to 
distinguish these means (where the number of samples required depends on the 
level of noise relative to S ) , he can determine the first bit of the key. 
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Differential Power Analysis (DPA). One of the most amazing and trouble- 
some features of differential power analysis is that, unlike with SPA, the attacker 
does not need such specific information about how the analyzed device imple- 
ments its function. In particular, she can be ignorant of the specific times at 
which the power consumption is correlated with the secret; it is only necessary 
that the correlation is reasonably consistent. 

In differential power analysis attacks, the attacker has identified some in- 
termediate value in the computation that is 1) correlated with the power con- 
sumption, and 2) dependent only on the plaintext (or ciphertext or both) and 
some small part of the key. She gathers a collection of power traces by sampling 
power consumption at a very high frequency throughout a series of encryptions 
of different plaintexts. If the intermediate value is sufficiently correlated with the 
power consumption, the adversary can use the power traces to verify guesses at 
the small part of the key. 

In particular, for each possible value of relevant part of the key, the attacker 
will divide the traces into groups according to the intermediate value predicted 
by current guess at the key and the trace’s corresponding plaintext (or cipher- 
text); if the averaged power trace of each group differs noticeably from the others 
(the averaged differences will have a large difference at the time of correlation) , 
it is likely that the current key guess is correct. Since incorrectly predicted inter- 
mediate value will not be correlated with the measured power traces, incorrect 
key guesses should result in all groups having very similar averaged power traces. 



2.2 Higher-Order Attacks 

A lriglrer-order attack addresses a situation where there is some intermediate 
value (or set of values) that depends only on the plaintext and some small part 
of the key, but it is not correlated directly with the power consumption at any 
particular time. Instead, this value contributes to the joint distribution of the 
power consumption at a few times during the computation. 

An important example of such a situation comes about when the masking 
(or duplication) technique is employed to protect against first-order attacks. 
As a typical example of masking, consider an implementation that wishes to 
perform a computation using some intermediate, key-dependent bit b. Rather 
than computing directly with b and opening itself up to DPA attacks, however, 
it performs the computation twice: once with a random bit r, then with the 
masked bit (r + b). 1 The implementation is designed to use these two masked 
intermediate results as inputs to the rest of the computation. 

In this case, knowledge of either r or r + b alone is not of any use to the at- 
tacker. Since the first-order attacks look for local, linear correlation of b with the 
power draw, they are stymied. If, however, an attack could correlate the power 

1 Though we use the symbol ‘+’ to denote the masking operation, we require nothing 
from it other than that c = a + b implies (— l) c = (— l) a+ii ; for our purposes, it is 
convenient to just assume that ‘+’ is exclusive-or. 




4 



J. Waddle and D. Wagner 



consumption at the time r is present and the time r + 6 is present (e.g., by mul- 
tiplying the power consumptions at these times), it could gain some information 
on 6 . 

For example, suppose a cryptographic device employs masking to hide some 
intermediate bit b that is derived directly from the key, but displays the following 
behavior: at lOO^iS, the average power draw is to + <5(— l) r and at 210 ^s it is 
m + 5(— l)( r + & ). An attacker aware of this fact could multiply the samples at 
these times for each trace and obtain a product value with expected value 2 



E [product of samples] 



m 2 + 5 2 if b = 0 
to 2 — S 2 if b = 1 



(1) 



Summing the samples over n encryptions, the means would be n(m 2 + S 2 ) for 
6 = 0 and n(m 2 — S 2 ) for 6=1. By choosing n large enough to reduce the relative 
effect of noise, the attacker could distinguish these distributions and deduce 6 . 
An attack of this sort is the second-order analog of an SPA attack. 

But how practical is this really? A higher-order attack seems to face two 
major problems: 

— How much does the process of correlation amplify the noise, thereby increas- 
ing standard deviation and requiring more samples to reliably differentiate 
distributions? 

— How does it identify the times when the power consumption is correlated 
with an intermediate value? 

The first issue is apparent when calculating the standard deviation of the product 
computed in the above attack. If the power consumption at times 100/zs and 
210 /zs both have standard deviation <r, then the product has standard deviation 



?(j 4 T 2a 2 m 2 + A5 2 m m + 2 6 2 a' 2 if 6 = 0, 
/er 4 + 2a 2 in 2 + 2 S 2 a 2 if 6 = 1 



effectively squaring the standard deviation of zero-mean noise. This means that 
substantially many more samples are required to distinguish the 6 = 0 and 6=1 
distributions than would be required in a first-order attack, if one were possible. 

The second issue is essentially the lrigher-order analog of the problem with 
SPA: attackers require exact knowledge of the time at which the intermediate 
value and the power consumption are correlated. DPA resolves this problem by 
considering many samples of the power consumption throughout an encryption. 
Unfortunately, the natural generalization of this approach to even second-order 
attacks, where a product would be accumulated for each ( 61 , 62 ) time pair, is 
extremely computationally taxing. The second-order attacks discussed in this 
paper avoid this overhead. 

2 Here the expectation is taken over both the random noise and the value of the 
masking bit r, and the noise components at times 100/iS and 210/rs are assumed 
independent. 
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3 The Power Analysis Model 

Both of the attacks we present are second-order attacks which are essentially 
preprocessing steps applied to the power traces followed by standard DPA. 

In this section, we develop our model and present standard DPA in this 
framework, both as a point of reference and as a necessary subroutine for our 
attacks, which are described in Section 4. 



3.1 The Model 



We assume that the attacker has guessed part of the key and has predicted an 
intermediate bit value b for each of the power traces, grouping them into a b = 0 
and a b = 1 group. For simplicity, we assume there are n traces in each of these 
groups: trace i from group b is called T b , where 0 < i < n. Each trace contains 
samples at m evenly spaced times; the sample at time t from this trace is denoted 
T b (t), where 0 < t < to. 

Each sample has a noise component and possibly a signal component, if it is 
correlated with b. We assume that each noise component is Gaussian with equal 
standard deviation and independent of the noise in other samples in its own 
trace and other traces. For simplicity, we also assume that the input has been 
normalized so that each noise component is a 0-mean Gaussian with standard 
deviation one (i.e., ~ Af( 0, 1)). The random variable for the noise component in 
trace i from group b at time t is S b (t), for 0 < t < m. 

We assume that the device being analyzed is utilizing masking so that there 
is a uniformly distributed independent random variable for each trace that cor- 
responds to the masking bit; it will be more convenient for us to deal with {±1} 
bit values, so if the random bit in trace i from group b is r, we define the random 
variable R b = (— l) r . 

Finally, if the guess for b is correct, the power consumption is correlated 
with the random masking bit and the intermediate value b at the same times in 
each trace. Specifically, we assume that there is some parameter d (in units of 
the standard deviation of the noise) and times cq and ci such that the random 
bit makes a contribution of dR b to the power consumption at time Co and the 
masked bit makes a contribution of d(— l) b R b at time C\. 

We can now characterize the trace sample distributions in terms of these 
noise and signal components: 

— If the guess of the key is correct, then for 0 < * < n, 0 < t < m, and 
b £ {0, 1}, we have: 



r Sf(t) + dR\ if t = Co 

TP(t) = l S*(i) + d(-l) b R b iff = Cl 
[s'f(t) otherwise 



(3) 



If the key is predicted incorrectly, however, then the groups are not correlated 
with the true value of b in each trace and hence there is no correlation 
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between the grouping and the power consumption in the traces, so, for 0 < 
i < n, 0 < t < m, and b G {0, 1}: 

T*{t) = S b (t) (4) 

Given these traces as inputs, the algorithms try to decide whether the groupings 
(and hence the guess for the key) are correct by distinguishing these distribu- 
tions. 



3.2 The Generic DPA Subroutine 

Both algorithms use a subroutine DPA after their preprocessing step. For our 
purposes, this subroutine simply takes the two groups of traces, T° and T 1 , a 
threshold value r, and determines whether the groups’ totalled traces differ by 
more than r at any sample time. If the difference of the totalled traces is greater 
than r at any point, DPA returns 1, indicating that T° and T 1 have different 
distributions; if the difference is no more than r at any point, DPA returns 0, 
indicating that it thinks T° and T 1 are identically distributed. 

DPA(T°, T 1 , r) 

1 : for each t G {0, . . . , m — 1}: 

2 : s G- 0 

3 : for each i G {0, . . . , n — 1}: 

4: _ 8<-a + T?(t)-Tl(t) 

5 : if |s| > r return 1 

6 : return 0 

When using the DPA subroutine, it is most important to pick the threshold, 
r, appropriately. Typically, to minimize the impact of false positives and false 
negatives, r should be half the difference. This is perhaps unexpected since 
false positives are actually far more likely than false negatives when using a 
midpoint threshold test since false positives can occur if any of the m times’ 
samples sum deviates above r, while false negatives require exactly the correlated 
time’s samples to deviate below r. The reason for not choosing r to equalize the 
probabilities is that false negatives are far more detrimental than false positives: 
an attack suggesting two likely subkeys is more helpful than an attack suggesting 
none. 

An equally important consideration in using DPA is whether r is large enough 
compared to the noise to reduce the probability of error. Typically, the samples’ 
noise components will be independent and the summed samples’ noise will be 
Gaussian, so we can can achieve negligible probability of error by using n large 
enough that r is some constant multiple of the standard deviation. 

DPA runs in time 0(nm). Each run of DPA decides the correctness of only 
one guessed grouping, however, so an attack that tries l groupings runs in time 
0(nml). 
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4 Our Second-Order Attacks 

The two second-order variants of DPA that we discuss are Zero-Offset 2D PA 
and FFT 2DPA. The former is applied in the special but not necessarily un- 
likely situation when the power correlation times for the two bits are coincident 
(i.e., the random bit r and the masked bit r + 6 are correlated with the power 
consumption at the same time). The latter attack applies to the more general 
situation where the attacker does not know the times of correlation; it discovers 
the correlation with only slight computational overhead but pays a price in the 
number of required samples. 



4.1 Zero-Offset 2DPA 



Zero-Offset 2DPA is a very simple variation of ordinary first-order DPA that can 
be applied against systems that employ masking in such a way that both the 
random bit r and the masked intermediate bit r + 6 correlate with the power 
consumption at the same time. In the language of our model, Co = C\ . 

The coincident effect of the two masked values may seem to be too specialized 
of a circumstance to occur in practice, but it does come up. The motivation for 
this attack is the claim by Coron and Goubin [3] that some techniques suggested 
by Messerges [4] were insecure due to some register containing the multi-bit 
intermediate value a or its complement a. Since Messerges assumes a power 
consumption model based on Hamming weight, it was not clear how a first-order 
attack would exploit this register. However, we observe that such a system can 
be attacked (even in the Hamming model) by a Zero-Offset 2DPA that uses as its 
intermediate value the exclusive-or of the first two bits of a. Another example of 
a situation with coincident power consumption correlation is in a paired circuit 
design that computes with both the random and masked inputs in parallel. 

Combining cq = Ci with Equation (3), we see that in a correct grouping: 



T?(t) 



Sf(t) + dR \ + d{-l) b R* if t = c 0 
S b (t) otherwise 



(5) 



In an incorrect grouping, T b (t) is distributed exactly as in the general uncorre- 
lated case in Equation (4). 

Note that in a correct grouping, when 6=1, the influence of the two bits 
cancel, leaving T^co) = S}(c o), while when 6 = 0, the influences of the two 
bits combine constructively and we get 71° (co) = S'°(co) + 2 dR®. In the former 
case, there appears to be no influence of the bits on the power consumption 
distribution, but in the latter case, the bits contribute a bimodal component. 
The bimodal component has mean 0, however, so it would not be apparent in a 
first-order averaging analysis. 

Zero-offset 2DPA exploits the bimodal component for the 6 = 0 case by 
simply squaring the samples in the power traces before running straight DPA. 
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Zero-Off set-2DPA(T’°, T 1 ) 

1 : for each b € {0, 1}, i 6 {0, . . . , n}, t € {0, . . . , m }: 

2 : TP(t) e- (7?(f)) 2 

3 : return DPA(T°, T 1 , 2nd 2 ) 

Why does this work? Suppose we have a correct grouping and consider the 
expected values for the sum of the squares of the samples at time cq in the two 
groups: 

— if b = 0, 



E 






_2= 0 



- if b= 1, 



= E E[(S?(co)) 2 + 4dS?(co)iZ? + 4 d 2 {B° l ) 2 } 

2=0 
n— 1 

= E (E[(S?(co)) 2 ] + E[4dS°(c 0 )i?°] + E[4d 2 (f?°) 2 ]) 



2 = 0 
n— 1 



= E ( 1 + 0+4d2 ) 



2 = 0 



= 4 ncr + n 



E 



n— 1 



EW(^)] S 



, 2=0 



n— 1 



= E E [(^(co)) 5 
2=0 
n— 1 

= E 4 



2=0 
= n 



( 6 ) 



( 7 ) 



The above derivations use the fact that if S ~ A/”(0, 1) then S' 2 ~ X 2 (l, 0) (i.e., S' 2 
has y 2 distribution with ^ = 1 degree of freedom and non-centrality parameter 
S 2 = 0), and the expected value of a x 2 (w ^ 2 ) random variable is v + S 2 . 

Thus, the expected difference of the sum of products for the Co samples 
is And 2 , while the expected difference for incorrect groupings is clearly 0. In 
Section A.l, we show that the difference of the groups’ sums of products is 
essentially Gaussian with standard deviation 

cr = y / n(16d 2 + 4). (8) 



For an attack that uses a DPA threshold value at least k standard deviations 
from the mean, we will need at least k 2 ■ ^ traces. This ^ blowup 

factor may be substantial; recall that d is in units of the standard deviation of 
the noise, so it may be significantly less than 1. 

The preprocessing in Zero-Off set-DPA takes time 0(nm). After this pre- 
processing, each of l subsequent guessed groupings can be tested using DPA in 
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time <9(nm), for a total runtime of 0(nm + nml ) = 0(nml). It is important to 
keep in mind when comparing these run times that the number n of required 
traces for Zero-Off set-DPA can be somewhat larger than would be necessary 
for first-order DPA — if a first-order attack were possible. 



A Natural Variation: Known-Offset 2DPA. If the difference s = c\ - cq 

is non-zero but known, a similar attack may be mounted. Instead of calculating 
the squares of the samples, the adversary can calculate the lagged product: 

L b i (t,s) = T*(t)-TP(t+s), (9) 

where the addition t + s is intended to be cyclic in {0, ... n — 1}. 

This lagged product at the correct offset s = C\ — cq has properties similar 
to the squared samples discussed above, and can be used in the same way. 

4.2 FFT 2DPA 

Fast Fourier Transform (FFT) 2D PA is useful in that it is more general than 
Zero-Offset 2DPA: it does not require that the times of correlation be coincident, 
and it does not require any particular information about cq and ci. 

To achieve this, it uses the FFT to compute the correlation of a trace with 
itself — an autocorrelation. The autocorrelation A\ of a trace Xf is also defined 
on values t £ {0, ... ,m — 1}, but this argument is considered an offset or lag 
value rather than an absolute time. Specifically, for b £ {0, 1}, 0 < i < n, and 
0 < f < to, 



m— 1 

A\{t)=Y J T i<j)- T i{j + 1) (10) 

i=o 

The argument t + j in Xf (j + t) is understood to be cyclic in {0, . . . , m — 1}, so 
that A\{t) = A b (m — t), and we really only need to consider 0 < t < m/2. 

To see why A b (t) might be useful, recall Equation (3) and notice that most 
of the terms of A b (t) are of the form S b (j) ■ S b (j + t); in fact, the only terms 
that differ are where j or j + t is Co or ci. This observation suggests a way to 
view the sum for A\(t) by splitting it up by the different types of terms from 
Equation (3), and in fact it is instructive to do so. To simplify notation, let 
Q = {co — t, Co, Ci — X, ci } , the set of “interesting” indices, where the terms of 
A\{t) are “unusual” when j £ Q. Assuming t yf Ci — cq, 



A^t) = S^ c o-t)-[S/(co) + dR^] 

+ [^(co) + dB/f] ■ S/(co + t ) 

+ Sf (cr - t) ■ [Sj(cr) + d{-l) b R\] 
+ [S?(ci) + d(— l) b Ri] ■ S/( Cl + t) 

HQ 



( 11 ) 
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and we can distribute and recombine terms to get 

A b (t) = [S b (c 0 -t) + S!(co + t)].dI$ 

+ [S b (c 1 -t) + S b (c 1 + t)]-d(-l) b R b 

m— 1 

+ ^S b (j)-S b (j + t). ( 12 ) 

j=o 

Using Equation (12) and the fact that E[X7] = E[X]-E[y] when X and Y are 
independent random variables, it is straightforward to verify that E[A|(t)] = 0 
when t ^ ci — Co; its terms in that case are products involving some 0-mean 
independent random variable (this is exactly what we show in Equation (15)). 
On the other hand, A b (ci — c 0 ) involves terms that are products of dependent 
random variables, as can be seen by reference to Equation (10) . We make frequent 
use of Equation (12) in our derivations in this section and in Appendix A. 2. 

This technique requires a subroutine to compute the autocorrelation of a 
trace: 

Autocorrelate(T) 

1 : F <r- FFT(T) 

2 : for each t £ {0, . . . , to — 1}: 

3: F{t)<-\F(t) | 2 

4 : return Inv-FFT(F) 

The \F(t)\ 2 in line 3 is the squared /U-norm of the complex number F(t) (i.e. , 
|F(t)| 2 = F(t) ■ F(t ), where a denotes the complex conjugate of a). 

The subroutine FFT computes the usual Discrete Fourier Transform: 

m — 1 

(FFT (T))(x)=Y,T(j)-u~ xj (13) 

3=0 

and Inv-FFT computes the Inverse Discrete Fourier Transform: 

1 m— 1 

(Inv-FFT (T))(y) = — ^ T(j) ■ u XJ (14) 

In the above equations, lo is a complex primitive mth root of unity (i.e., oj € C, 
Lu m = 1, and to k ^ 1 for all 0 < k < in). 

The subroutines FFT, Inv-FFT, and therefore Autocorrelate itself all run in 
time @(to log to). 

We can now define the top-level FFT-2DPA algorithm: 

FFT-2DPA(T°,T 1 ,r) 

1 : for each b € {0, 1}, t e {0, . . . , to — 1}: 

2 : Z b (t) <- 0 

3 : for each b € {0, 1}, i e {0, . . . , n — 1}: 

4 : A b <— Autocorrelate(Tj b ) 
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5 : for each t £ {0, . . . , m — 1}: 

6 : Z b (t) <- Z b (t) + A b (t) 

7 : return DPA (Z°, Z 1 , nd 2 ) 

What makes this work? Assuming a correct grouping, the expected sums are: 
- t ^ Ci - c 0 : 



E [Z b (t)\ = E 



n— 1 m— 1 






m— 1 

= n^E[T b (j).r b (i + t)] 
j=o 

= nE [[5g(co - i) + S b (c 0 + i)] • dR b ] 

+ nE [[5 q(ci -t) + S b ( Cl + *)] • d(-l) fa i?g] 

+ nmE [S'o(O) ■ /S$(0 + 1)] 

= 0 



- t = (c i - c 0 ): 



(15) 



E [Z b (t)\ = E 



n— 1 m— 1 



^^(It b (i)-71 b (i + t)) 



m— 1 

= n£E[I*(j)-l*(j + i)] 

3=0 

= nE [[5 b (c 0 -t)+ S 0 b ( Cl )] • dR b } 

+ nE [[5 b (c 0 ) + S b ( Cl + 1)] ■ d(-l) b R b ] 
+ n,E[d 2 (i? b ) 2 (-l) b ] 

+ nmE[5»-S o b (0 + t)] 

= 0 + 0 + n.E[d 2 (i? b ) 2 (-l) b ] + 0 
= nd 2 (- l) b 



So in a correct grouping, we have 



E [Z°{t)- Z\t)\ 



2nd 2 if t = Ci — Co 
0 otherwise. 



(16) 



(17) 



In incorrect groupings, however, K[Z°(t) — Z 1 (f)] = 0 for all t € {0, . . . ,m — 

!}• 

In Section A. 2, we see that this dist ribution is cl osely approximated by a 
Gaussian with standard deviation a = \J n(8d 2 + 2m), so that an attacker who 
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wishes to use a threshold at least k standard deviations away from the mean 
needs n to be at least about k 2 ■ . 

Note that the noise from the other samples contributes significantly to the 
standard deviation at Z b (ci — cq), so this attack would only be practical for 
relatively short traces and a significant correlated bit influence (i.e., when m is 
small and d is not much smaller than 1). 

The preprocessing in FFT-2DPA runs in time 0{nm log m). After this pre- 
processing, however, each of l guessed groupings can be tested using DPA in 
time 0(nm), for a total runtime of 0(nm\ogm + nml ), amortizing to 0(nml) 
if l = f2(log m). Again, when considering this runtime, it is important to keep 
in mind that the number n of required traces can be substantially larger than 
would be necessary for first-order DPA -if a first-order attack were possible. 



FFT and Known-Offset 2DPA. It might be very helpful in practice to use 
the FFT in second-order power analysis attacks for attempting to determine the 
offset of correlation. With a few traces, it could be possible to use an FFT to 
find the offset s of repeated computations, such as when the same function is 
computed with the random bit r at time Co and with the masked bit r + b at 
time Co + s. 

With even a few values of s suggested by an FFT on these traces, a Known- 
Offset 2DPA attack could be attempted, which could require far fewer traces 
than straight FFT 2D PA since Known-Offset 2DPA suffers from less noise am- 
plification. 

5 Conclusion 

We explored two second-order attacks that attempt to defeat masking while 
minimizing computation resource requirements in terms of space and time. 

The first, Zero-Offset 2DPA, works in the special situation where the masking 
bit and the masked bit are coincidentally correlated with the power consumption, 
either canceling out or contributing a bimodal component. It runs with almost no 
noticeable overhead over standard DPA, but the number of required power traces 
increases more quickly with the relative noise present in the power consumption. 

The second technique, FFT 2DPA, works in the more general situation where 
the attacker knows very little about the device being analyzed and suffers only 
logarithmic overhead in terms of runtime. On the other hand, it also requires 
many more power traces as the relative noise increases. 

In summary, we expect that Zero-Offset 2DPA and Known-Offset 2D PA can 
be of some practical use, but FFT 2DPA probably suffers from too much noise 
amplification to be generally effective. However, if the traces are fairly short and 
the correlated bit influence fairly large, it can be effective. 
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A Noise Amplification 

In this section, we attempt to characterize the distribution of the estimators 
that we use to distinguish the target distributions. In particular, we show that 
the estimators have near-Gaussian distributions and we calculate their standard 
deviations. 



A.l Zero-Offset 2DPA 



As in Section 4.1, we assume that the times of correlation are coincident, so 
that Co = Ci. From this, we get that the distribution of the samples in a correct 
grouping follows Equation (5): 



T*{t) 



S\(t) + dR\ + d(-l) b R\ if t = c 0 
S b (t) otherwise 



(18) 



The sum 



n — 1 n — 1 

Ep ?( c °)] 2 = E ^ fo ) + 2dR ^ ( 19 ) 

i— 0 i— 0 

is then a x 2 (g <5 2 )-distributed random variable with v = n degrees of freedom 
and non-centrality parameter 8 2 = Y^i=o (^dR®) 2 = And 2 . It has mean v + 8 2 = 
And 2 +n and standard deviation \j2(y + 2S 2 ) = \j2(n + 8 nd 2 ) = n(16d 2 + 2). 

A common rule of thumb is that ^-distributed random variables with over 
thirty degrees of freedom are closely approximated by Gaussians. We expect 
n 30, so we say 

n— 1 

EKVo)] 2 ~ A/ - ^And 2 + n, \/n(16d 2 + 2)^ . 

2=0 



(20) 
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Similarly, we obtain Y^iZo\Ti( c o)] 2 ~ x 2 (n, 0), which, since n !3> 30, we 
approximate with 

n— 1 

^[^(co)] 2 ~ -AT («, v^) . (21) 

i=0 

The difference of the summed squares is then 

n— 1 

E [(^Vo)) 2 - (I?(co)) 2 ] ~ M (4 nd\ \J n(16d 2 + 4)) . (22) 

i= 0 

A. 2 FFT 2DPA 

Recalling our discussion from Section 4.2, we want to examine the distribution 
of 



n— 1 m— 1 

4 ) = EE^>^ +t )' ( 23 ) 

0 j— 0 

when t = ci — Co- Its standard deviation should dominate that of Z b (t') for 
t' 7^ Ci — Co (for simplicity, we assume ci — Co Co — Ci). 

In Section 4.2, we saw that E [Z b (t)\ = nd 2 (—l) b . We would now like to 
calculate its standard deviation. 

In the following, we liberally use the fact that 

Var [X + Y] = Var [X] + Var [Y] + 2 Cov [X, Y ] , (24) 

where Cov [A, Y] is the covariance of A and Y (Cov [A, Y] = E[A7]-E[A] E[V]). 
We would often like to add variances of random variables that are not indepen- 
dent; Equation (24) says we can do so if the random variables have 0 covariance. 
Since the traces are independent and identically distributed, 



n— 1 



Var [Z b {t)] = E Var 



i—0 



= nVar 



771—1 



E2?(j)-7?(i + *) 



3 = 0 



771—1 



E T oW- T o(i + *) 



3=0 



— n Var [di?o([Sf(co — t) + S'g(ci)] + (— l) fc [5g(co) + Sq(ci + t)])_ 



n Var 



E 5 oO')-^(j+t) 

3= 0 



(25) 



where we were able to split the variance in the last line since the two terms have 
0 covariance. 
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To calculate Var [di^QS^co — t) + Sq(ci)] + (-l) b [So(c 0 ) + Sq(ci + *)])] , 
note that its terms have 0 covariance. For example: 

Cov[dR b S b (c 0 - t),dR b 0 S%( Cl )] = E[(di? b ) 2 S 0 b (c 0 - t) ■ S§(d)] 

-E[d^^(c 0 -t)]E[d^5 0 b ( Cl )] 

= 0-0 = 0 (26) 

since the expectation of a product involving an independent 0-mean random 
variable is 0. Furthermore, it is easy to check that each term has the same 
variance, and 

Var [di$Sg(d)] = E [[dR b S b ( Cl )} 2 ] - E [dR b 0 S b 0 ( Cl )] 2 
= d 2 E [[•S'o (cr)] 2 ] — 0 

= d 2 , (27) 

for a total contribution of 

Var [dR b ([S b (c 0 -t) + 5 b ( Cl )] + (-l) b [S b (c 0 ) + S b ( Cl + i)])] = 4d 2 . (28) 

The calculation of Var ^o(j) ' So(j + t) is similar since its terms also 

have covariance 0 and they all have the same variance. Thus, 

(j + t) =mVar[S b (0)S b (0 + t)} 

= m (E [[S$(0)] 2 [S$(t)] 2 ] - E[S b (0)S b (t)] 2 ) 

= m( 1 + 0) = m. (29) 

Finally, plugging Equations (28) and (29) into Equation (25), we get the 
result 

Var [Z b {t)\ = n(m + 4d 2 ) (30) 

and the corresponding standard deviation is yjn(m + 4d 2 ). 

As in Section A.l, we expect n to be large and we say 

Z b {t) ~ M (nd 2 (- l) b , Vn( 4d 2 + to)) . (31) 

Finally, we get the distribution of the difference: 




Z°(t) - Z\t ) ~ M (2nd 2 , \/n(8d 2 + 2m 



(32) 
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Abstract. A classical model is used for the power consumption of cryp- 
tographic devices. It is based on the Hamming distance of the data han- 
dled with regard to an unknown but constant reference state. Once val- 
idated experimentally it allows an optimal attack to be derived called 
Correlation Power Analysis. It also explains the defects of former ap- 
proaches such as Differential Power Analysis. 

Keywords: Correlation factor, CPA, DPA, Hamming distance, power 
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1 Introduction 

In the scope of statistical power analysis against cryptographic devices, two 
historical trends can be observed. The first one is the well known differential 
power analysis (DPA) introduced by Paul Kocher [12,13] and formalized by 
Thomas Messerges et al. [16]. The second one has been suggested in various 
papers [8,14,18] and proposed to use the correlation factor between the power 
samples and the Hamming weight of the handled data. Both approaches exhibit 
some limitations due to unrealistic assumptions and model imperfections that 
will be examined more thoroughly in this paper. This work follows previous 
studies aiming at either improving the Hamming weight model [2] , or enhancing 
the DPA itself by various means [6,4], 

The proposed approach is based on the Hamming distance model which can 
be seen as a generalization of the Hamming weight model. All its basic assump- 
tions were already mentioned in various papers from year 2000 [16,8,6,2]. But 
they remained allusive as possible explanation of DPA defects and never leaded 
to any complete and convenient exploitation. Our experimental work is a synthe- 
sis of those former approaches in order to give a full insight on the data leakage. 
Following [8,14,18] we propose to use the correlation power analysis (CPA) to 
identify the parameters of the leakage model. Then we show that sound and 
efficient attacks can be conducted against unprotected implementations of many 
algorithms such as DES or AES. This study deliberately restricts itself to the 
scope of secret key cryptography although it may be extended beyond. 

This paper is organized as follows: Section 2 introduces the Hamming dis- 
tance model and Section 3 proves the relevance of the correlation factor. The 
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model based correlation attack is described in Section 4 with the impact on the 
model errors. Section 5 addresses the estimation problem and the experimental 
results which validate the model are exposed in Section 6. Section 7 contains 
the comparative study with DPA and addresses more specifically the so-called 
“ghost peaks” problem encountered by those who have to deal with erroneous 
conclusions when implementing classical DPA on the substitution boxes of the 
DES first round: it is shown there how the proposed model explains many defects 
of the DPA and how the correlation power analysis can help in conducting sound 
attacks in optimal conditions. Our conclusion summarizes the advantages and 
drawbacks of CPA versus DPA and reminds that countermeasures work against 
both methods as well. 

2 The Hamming Distance Consumption Model 

Classically, most power analyses found in literature are based upon the Hamming 
weight model [13,16], that is the number of bits set in a data word. In a m-bit 
microprocessor, binary data is coded D = ^2j = o dj2^ , with the bit values dj = 0 
or 1. Its Hamming weight is simply the number of bits set to 1, H(D) = dj- 

Its integer values stand between 0 and m. If D contains m independent and 
uniformly distributed bits, the whole word has an average Hamming weight 
jin = m/2 and a variance cr 2 H = m/4. 

It is generally assumed that the data leakage through the power side-channel 
depends on the number of bits switching from one state to the other [6,8] at a 
given time. A microprocessor is modeled as a state-machine where transitions 
from state to state are triggered by events such as the edges of a clock signal. 
This seems relevant when looking at a logical elementary gate as implemented in 
CMOS technology. The current consumed is related to the energy required to flip 
the bits from one state to the next. It is composed of two main contributions: the 
capacitor’s charge and the short circuit induced by the gate transition. Curiously, 
this elementary behavior is commonly admitted but has never given rise to any 
satisfactory model that is widely applicable. Only hardware designers are famil- 
iar with simulation tools to foresee the current consumption of microelectronic 
devices. 

If the transition model is adopted, a basic question is posed: what is the refer- 
ence state from which the bits are switched? We assume here that this reference 
state is a constant machine word, R, which is unknown, but not necessarily 
zero. It will always be the same if the same data manipulation always occurs at 
the same time, although this assumes the absence of any desynchronizing effect. 
Moreover, it is assumed that switching a bit from 0 to 1 or from 1 to 0 requires 
the same amount of energy and that all the machine bits handled at a given 
time are perfectly balanced and consume the same. 

These restrictive assumptions are quite realistic and affordable without any 
thorough knowledge of microelectronic devices. They lead to a convenient ex- 
pression for the leakage model. Indeed the number of flipping bits to go from R 
to D is described by H{D ® R) also called the Hamming distance between D 
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and R. This statement encloses the Hamming weight model which assumes that 
R = 0. If D is a uniform random variable, so is D ® R, and H(D © R) has the 
same mean to/ 2 and variance to/ 4 as H{D). 

We also assume a linear relationship between the current consumption and 
H(D®R). This can be seen as a limitation but considering a chip as a large set of 
elementary electrical components, this linear model fits reality quite well. It does 
not represent the entire consumption of a chip but only the data dependent part. 
This does not seem unrealistic because the bus lines are usually considered as 
the most consuming elements within a micro-controller. All the remaining things 
in the power consumption of a chip are assigned to a term denoted b which is 
assumed independent from the other variables: b encloses offsets, time dependent 
components and noise. Therefore the basic model for the data dependency can 
be written: 

W = aH(D (B R) + b 

where a is a scalar gain between the Hamming distance and W the power con- 
sumed. 

3 The Linear Correlation Factor 

A linear model implies some relationships between the variances of the different 
terms considered as random variables: = a 2 aj I + of. Classical statistics in- 

troduce the correlation factor pwh between the Hamming distance and the mea- 
sured power to assess the linear model fitting rate. It is the covariance between 
both random variables normalized by the product of their standard deviations. 
Under the uncorrelated noise assumption, this definition leads to: 

cov(W, H) a<7H nan ay/m 

H &w&h &w \J a 2 aj 1 + ct/ \J ma 2 + 4 

This equation complies with the well known property: —1 < pwh < +1: for a 
perfect model the correlation factor tends to ±1 if the variance of noise tends to 
0, the sign depending on the sign of the linear gain a. If the model applies only 
to l independent bits amongst m, a partial correlation still exists: 

aVl IT 

PWH l/m = 7 = = j = PWH\ — 

y/ ma 2 + 4cr/ V to 

4 Secret Inference Based on Correlation Power Analysis 

The relationships written above show that if the model is valid the correlation 
factor is maximized when the noise variance is minimum. This means that Pwh 
can help to determine the reference state R. Assume, just like in DPA, that a set 
of known but randomly varying data D and a set of related power consumption 
W are available. If the 2 m possible values of R are scanned exhaustively they 
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can be ranked by the correlation factor they produce when combined with the 
observation W. This is not that expensive when considering an 8-bit micro- 
controller, the case with many of today’s smart cards, as only 256 values are to 
be tested. On 32-bit architectures this exhaustive search cannot be applied as 
such. But it is still possible to work with partial correlation or to introduce prior 
knowledge. 

Let R be the true reference and H = H(D ® R) the right prediction on the 
Hamming distance. Let R' represent a candidate value and H' the related model 
H' = H(D © R r ). Assume a value of R' that has k bits that differ from those 
of R, then: H(R © R') = k. Since b is independent from other variables, the 
correlation test leads to (see [5]): 

cov(aH + b, H') a cov(fJ, H') m — 2k 

PWH’ = 7 = 7 = PWHPHH ' = PWH 

This formula shows how the correlation factor is capable of rejecting wrong 
candidates for R. For instance, if a single bit is wrong amongst an 8-bit word, 
the correlation is reduced by 1/4. If all the bits are wrong, i-e R! = ->R, then an 
anti-correlation should be observed with pwh' — ~Pwh- In absolute value or if 
the linear gain is assumed positive (a > 0), there cannot be any R' leading to a 
higher correlation rate than R. This proves the uniqueness of the solution and 
therefore how the reference state can be determined. 

This analysis can be performed on the power trace assigned to a piece of 
code while manipulating known and varying data. If we assume that the han- 
dled data is the result of a XOR operation between a secret key word K and a 
known message word M, D = K © M, the procedure described above, i-e ex- 
haustive search on R and correlation test, should lead to K © R associated with 
ma x{pwh)- Indeed if a correlation occurs when M is handled with respect to 
i?i, another has to occur later on, when M © K is manipulated in turn, possibly 
with a different reference state R 2 (in fact with K ©i? 2 since only M is known). 

For instance, when considering the first AddRoundKey function at the begin- 
ning of the AES algorithm embedded on an 8-bit processor, it is obvious that 
such a method leads to the whole key masked by the constant reference byte _R 2 - 
If R 2 is the same for all the key bytes, which is highly plausible, only 2 8 possi- 
bilities remain to be tested by exhaustive search to infer the entire key material. 
This complementary brute force may be avoided if R 2 is determined by other 
means or known to be always equal to 0 (on certain chips). 

This attack is not restricted to the © operation. It also applies to many 
other operators often encountered in secret key cryptography. For instance, other 
arithmetic, logical operations or look-up tables (LUT) can be treated in the 
same manner by using H(LUT(M * K) © R ), where * represents the involved 
function i.e. ©, +, -, OR, AND, or whatever operation. Let’s notice that the 
ambiguity between K and K ® R is completely removed by the substitution 
boxes encountered in secret key algorithms thanks to the non-linearity of the 
corresponding LUT: this may require to exhaust both K and R , but only once 
for R in most cases. To conduct an analysis in the best conditions, we emphasize 
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the benefit of correctly modeling the whole machine word that is actually handled 
and its transition with respect to the reference state R which is to be determined 
as an unknown of the problem. 



5 Estimation 



In a real case with a set of N power curves W t and N associated random data 
words Mi, for a given reference state R the known data words produce a set of 
N predicted Hamming distances = H(Mi ® R). An estimate pwh of the 
correlation factor pwh is given by the following formula: 



Pwh{R) 



N'EWjHi'R-'EWi'EHj'R 

VN E Wf - (E Wi)^N E Hf jR (E H hR y 



where the summations are taken over the N samples (i = 1 ,7V) at each time 
step within the power traces Wi(t). 

It is theoretically difficult to compute the variance of the estimator pwh 
with respect to the number of available samples N. In practice a few hundred 
experiments suffice to provide a workable estimate of the correlation factor. N 
has to be increased with the model variance to/ 4 (higher on a 32-bit architecture) 
and in presence of measurement noise level obviously. Next results will show that 
this is more than necessary for conducting reliable tests. The reader is referred 
to [5] for further discussion about the estimation on experimental data and 
optimality issues. It is shown that this approach can be seen as a maximum 
likelihood model fitting procedure when R is exhausted to maximize Pwh- 



6 Experimental Results 

This section aims at confronting the leakage model to real experiments. General 
rules of behavior are derived from the analysis of various chips for secure devices 
conducted during the passed years. 

Our first experience was performed onto a basic XOR algorithm implemented 
in a 8-bit chip known for leaking information (more suitable for didactic pur- 
pose). The sequence of instructions was simply the following: 

— load a byte D\ into the accumulator 

— XOR D\ with a constant D 2 

— store the result from the accumulator to a destination memory cell. 

The program was executed 256 times with D\ varying from 0 to 255. As 
displayed on Figure 1, two significant correlation peaks were obtained with two 
different reference states: the first one being the address of , the second one the 
opcode of the XOR instruction. These curves bring the experimental evidence 
of leakage principles that previous works just hint at, without going into more 
detail [16,8,6,17]. They illustrate the most general case of a transfer sequence 
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Fig. 1 . Upper: consecutive correlation peaks for two different reference states. Lower: 
for varying data (0-255), model array and measurement array taken at the time of the 
second correlation peak. 



on a common bus. The address of a data word is transmitted just before its 
value that is in turn immediately followed by the opcode of the next instruction 
which is fetched. Such a behavior can be observed on a wide variety of chips 
even those implementing 16 or 32-bit architectures. Correlation rates ranging 
from 60% to more than 90% can often be obtained. Figure 2 shows an example 
of partial correlation on a 32-bit architecture: when only 4 bits are predicted 
among 32, the correlation loss is in about the ratio y/8 which is consistent with 
the displayed correlations. 

This sort of results can be observed on various technologies and implemen- 
tations. Nevertheless the following restrictions have to be mentioned: 

— Sometimes the reference state is systematically 0. This can be assigned to the 
so-called pre-charged logic where the bus is cleared between each transferred 
value. Another possible reason is that complex architectures implement sep- 
arated busses for data and addresses, that may prohibit certain transitions. 
In all those cases the Hamming weight model is recovered as a particular 
case of the more general Hamming distance model. 
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Fig. 2. Two correlation peaks for full word (32 bits) and partial (4 bits) predictions. 
According to theory the 20% peak should rather be around 26%. 



— The sequence of correlation peaks may sometimes be blurred or spread over 
the time in presence of a pipe line. 

— Some recent technologies implement hardware security features designed to 
impede statistical power analysis. These countermeasures offer various levels 
of efficiencies going from the most naive and easy to bypass, to the most 
effective which merely cancel any data dependency. 

There are different kinds of countermeasures which are completely similar to 
those designed against DPA. 

— Some of them consist in introducing desynchronization in the execution of 
the process so that the curves are not aligned anymore within a same ac- 
quisition set. For that purpose there exist various techniques such as fake 
cycles insertion, unstable clocking or random delays [6,18]. In certain cases 
their effect can be corrected by applying appropriate signal processing. 

— Other countermeasures consist in blurring the power traces with additional 
noise or filtering circuitry [19]. Sometimes they can be bypassed by curves 
selection and/or averaging or by using another side channel such as electro- 
magnetic radiation [9,1]. 

— The data can also be ciphered dynamically during a process by hardware 
(such as bus encryption) or software means (data masking with a random 
[11,7,20,10]), so that the handled variables become unpredictable: then no 
correlation can be expected anymore. In theory sophisticated attacks such 
as higher order analysis [15] can overcome the data masking method; but 
they are easy to thwart in practice by using desynchronization for instance. 

Indeed, if implemented alone, none of these countermeasures can be considered 
as absolutely secure against statistical analyses. They just increase the amount 
of effort and level of expertise required to achieve an attack. However combined 
defenses, implementing at least two of these countermeasures, prove to be very 
efficient and practically dissuasive. The state of the art of countermeasures in 
the design of tamper resistant devices has made big advances in the recent years. 
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It is now admitted that security requirements include sound implementations as 
much as robust cryptographic schemes. 

7 Comparison with DPA 

This section addresses the comparison of the proposed CPA method with Dif- 
ferential Power Analysis (DPA) . It refers to the former works done by Messerges 
et al. [16,17] who formalized the ideas previously suggested by Kocher [12,13]. 
A critical study is proposed in [5]. 



7.1 Practical Problems with DPA: The “Ghost Peaks” 

We just consider hereafter the practical implementation of DPA against the DES 
substitutions (1st round). In fact this well-known attack works quite well only 
if the following assumptions are fulfilled: 

1. Word space assumption: within the word hosting the predicted bit, the con- 
tribution of the non-targeted bits is independent of the targeted bit value. 
Their average influence in the curves pack of 0 is the same as that in the 
curves pack of 1. So the attacker does not need to care about these bits. 

2. Guess space assumption: the predicted value of the targeted bit for any 
wrong sub-key guess does not depend on the value associated to the correct 
guess. 

3. Time space assumption: the power consumption W does not depend on the 
value of the targeted bit except when it is explicitly handled. 

But when confronted to the experience, the attack comes up against the 
following facts. 

— Fact A. For the correct guess, DPA peaks appear also when the targeted 
bit is not explicitly handled. This is worth being noticed albeit not really 
embarrassing. However this contradicts the third assumption. 

— Fact B. Some DPA peaks also appear for wrong guesses: they are called 
“ghost peaks”. This fact is more problematic for making a sound decision 
and comes in contradiction with the second assumption. 

— Fact C. The true DPA peak given by the right guess may be smaller than 
some ghost peaks, and even null or negative! This seems somewhat amazing 
and quite confusing for an attacker. The reasons must be searched for inside 
the crudeness of the optimistic first assumption. 



7.2 The “Ghost Peaks” Explanation 

With the help of a thorough analysis of substitution boxes and the Hamming 
distance model it is now possible to explain the observed facts and show how 
wrong the basic assumptions of DPA can be. 
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Fact A. As a matter of fact some data handled along the algorithm may be par- 
tially correlated with the targeted bit. This is not that surprising when looking 
at the structure of the DES. A bit taken from the output nibble of a SBox has 
a lifetime lasting at least until the end of the round (and beyond if the left part 
of the IP output does not vary too much) . A DPA peak rises each time this bit 
and its 3 peer bits undergo the following P permutation since they all belong to 
the same machine word. 

Fact B. The reason why wrong guesses may generate DPA peaks is that the 
distributions of an SBox output bit for two different guesses are deterministic 
and so possibly partially correlated. The following example is very convincing 
about that point. Let’s consider the leftmost bit of the fifth SBox of the DES 
when the input data D varies from 0 to 63 and combined with two different 
sub-keys : MSB(SBox 5 (D ® 0x00)) and MSB(SBox 5 (D ® 0x36)). Both series of 
bits are respectively listed hereafter, with their bitwise XOR on the third line: 

1101101010010110001001011001001110101001011011010101001000101101 

1001101011010110001001011101001010101101011010010101001000111001 

0100000001000000000000000100000100000100000001000000000000010100 

The third line contains 8 set bits, revealing only eight errors of prediction among 
64. This example shows that a wrong guess, say 0, can provide a good prediction 
at a rate of 56/64, that is not that far from the correct one 0x36. The result would 
be equivalent for any other pair of sub- keys K and K ® 0x36. Consequently a 
substantial concurrent DPA peak will appear at the same location than the right 
one. The weakness of the contrast will disturb the guesses ranking especially in 
presence of high SNR. 

Fact C. DPA implicitly considers the word bits carried along with the targeted 
bit as uniformly distributed and independent from the targeted one. This is 
erroneous because implementation introduces a deterministic link between their 
values. Their asymmetric contribution may affect the height and sign of a DPA 
peak. This may influence the analysis on the one hand by shrinking relevant 
peaks, on the other hand by enhancing meaningless ones. There exists a well 
known trick to bypass this difficulty as mentioned in [4]. It consists in shifting 
the DPA attacks a little bit further in the processing and perform the prediction 
just after the end of the first round when the right part of the data (32 bits) is 
XORed with the left part of the IP output. As the message is chosen freely, this 
represents an opportunity to re-balance the loss of randomness by bringing new 
refreshed random data. But this does not fix Fact B in a general case . 

To get rid of these ambiguities the model based approach aims at taking 
the whole information into account. This requires to introduce the notion of 
algorithmic implementation that DPA assumptions completely occult. 

When considering the substitution boxes of the DES, it cannot be avoided 
to remind that the output values are 4-bit values. Although these 4 bits are in 
principle equivalent as DPA selection bits, they live together with 4 other bits in 
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the context of an 8-bit microprocessor. Efficient implementations use to exploit 
those 4 bits to save some storage space in constrained environments like smart 
card chips. A trick referred to as “SBox compression” consists in storing 2 SBox 
values within a same byte. Thus the required space is halved. There are different 
ways to implement this. Let’s consider for instance the 2 first boxes: instead of 
allocating 2 different arrays, it is more efficient to build up the following look- 
up table: LUTi 2 (fc) = SBoxi (&) || SBox 2 (fc). For a given input index k, the 
array byte contains the values of two neighboring boxes. Then according to the 
Hamming distance consumption model, the power trace should vary like: 

— H(LUTi 2 (Di © A'i) ® i?i) when computing SBoxi. 

— H(LUTi 2 (D 2 © iv 2 ) © R 2 ) when computing SBox 2 . 

If the values are bind like this, their respective bits cannot be considered as 
independent anymore. To prove this assertion we have conducted an experiment 
on a real 8-bit implementation that was not protected by any DPA countermea- 
sures. Working in a “white box” mode, the model parameters had been previ- 
ously calibrated with respect to the measured consumption traces. The reference 
state R = 0xB7 had been identified as the Opcode of an instruction transferring 
the content of the accumulator to RAM using direct addressing. The model fitted 
the experimental data samples quite well; their correlation factor even reached 
97%. So we were able to simulate the real consumption of the Sbox output with 
a high accuracy. Then the study consisted in applying a classical single bit DPA 
to the output of SBoxi in parallel on both sets of 200 data samples: the measured 
and the simulated power consumptions. 

As figure 3 shows, the simulated and experimental DPA biases match par- 
ticularly well. One can notice the following points: 

— The 4 output bits are far from being equivalent. 

— The polarity of the peak associated to the correct guess 24 depends on the 
polarity of the reference state. As R = 0xB7 its leftmost nibble aligned with 
SBox! is 0 xB = ’1011’ and only the selection bit 2 (counted from the left) 
results in a positive peak whereas the 3 others undergo a transition from 1 
to 0, leading to a negative peak. 

— In addition this bit is a somewhat lucky bit because when it is used as selec- 
tion bit only guess 50 competes with the right sub-key. This is a particular 
favorable case occurring here on SBox!, partly due to the set of 200 used 
messages. It cannot be extrapolated to other boxes. 

— The dispersion of the DPA bias over the guesses is quite confuse (see bit 4) . 

The quality of the modeling proves that those facts cannot be incriminated to 
the number of acquisitions. Increasing it much higher than 200 does not help: 
the level of the peaks with respect to the guesses does not evolve and converges 
to the same ranking. This particular counter-example proves that the ambiguity 
of DPA does not lie in imperfect estimation but in wrong basic hypotheses. 
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Fig. 3. DPA biases on SBoxi versus guesses for selection bits 1, 2, 3 and 4, on modeled 
and experimental data; the correct guess is 24. 



7.3 Results of Model Based CPA 

For comparison the table hereafter provides the ranking of the 6 first guesses 
sorted by decreasing correlation rates. This result is obtained with as few as 
only 40 curves! The full key is 11 22 33 44 55 66 77 88 in hexadecimal format 
and the corresponding sub-keys at the first round are 24, 19, 8, 8, 5, 50, 43, 2 in 
decimal representation. 



SBoxi SBox 2 SBox 3 SB 0 X 4 SB 0 X 5 SBox 6 SB 0 X 7 SBox 8 



K 


Pmax 


K 


Pmax 


K 


Pmax 


K 


Pmax 


K 


Pmax 


K 


Pmax 


K 


Pmax 


K 


Pmax 


\24 


92% 


19 


90% 


8 


87% 


8 


88% 


5 


91% 


50 


92% 


43 


89% 


2 


89% 


00 


74% 


18 


77% 


18 


69% 


44 


67% 


32 


71% 


25 


71% 


42 


76% 


28 


77% 


01 


74% 


57 


70% 


05 


68% 


49 


67% 


25 


70% 


05 


70% 


52 


70% 


61 


76% 


33 


74% 


02 


70% 


22 


66% 


02 


66% 


34 


69% 


54 


70% 


38 


69% 


41 


72% 


15 


74% 


12 


68% 


58 


66% 


29 


66% 


61 


67% 


29 


69% 


0 


69% 


37 


70% 


06 


74% 


13 


67% 


43 


65% 


37 


65% 


37 


67% 


53 


67% 


30 


68% 


15 


69% 



This table shows that the correct guess always stands out with a good contrast. 
Therefore a sound decision can be made without any ambiguity despite a rough 
estimation of p max . 

A similar attack has also been conducted on a 32-bit implementation, in a 
white box mode with a perfect knowledge of the implemented substitution tables 
and the reference state which was 0. The key was 7C A1 10 45 4A 1A 6E 57 in 
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hexadecimal format and the related sub-keys at the 1st round were 28, 12, 43, 
0, 15, 60, 5, 38 in decimal representation. The number of curves is 100. As next 
table shows, the contrast is good between the correct and the most competing 
wrong guess (around 40% on boxes 1 to 4). The correlation rate is not that high 
on boxes 5 to 8, definitely because of partial and imperfect modeling, but it 
proves to remain exploitable and thus a robust indicator. When the number of 
bits per machine word is greater, the contrast between the guesses is relatively 
enhanced, but finding the right model could be more difficult in a black box 
mode. 



SBoxi SBox 2 SBox 3 SB0X4 SB0X5 SBox 6 SB0X7 SBox 8 



K 


Pmax 


K 


Pmax 


K 


Pmax 


K 


Pmax 


K 


Pmax 


K 


Pmax 


K 


Pmax 


K 


Pmax 


OO 

CM 


77% 


12 


69% 


43 


73 % 


0 


82% 


15 


52% 


60 


51% 


5 


51% 


38 


47 % 


19 


36% 


27 


29% 


40 


43 % 


29 


43% 


03 


33% 


10 


34% 


15 


40% 


05 


29% 


42 


35% 


24 


27% 


36 


35 % 


20 


35% 


58 


30% 


58 


33% 


6 


29% 


55 


26% 


61 


31% 


58 


27% 


06 


33 % 


60 


32% 


10 


30% 


18 


31% 


12 


29% 


39 


25% 



8 Conclusion 

Our experience on a large set of smart card chips over the last years has con- 
vinced us on the validity of the Hamming distance model and the advantages of 
the CPA method against DPA, in terms of efficiency, robustness and number of 
experiments. An important and reassuring conclusion is that all the countermea- 
sures designed against DPA offer the same defensive efficiency against the model 
based CPA attack. This is not that surprising since those countermeasures aim 
at undermining the common prerequisites that both approaches are based on: 
side-channel observability and intermediate variable predictability. 

The main drawback of CPA regards the characterization of the leakage model 
parameters. As it is more demanding than DPA, the method may seem more 
difficult to implement. However it may be objected that: 

— A statistical power analysis of any kind is never conducted blindly without 
any preliminary reverse engineering (process identification, bit tracing): this 
is the opportunity to quantify the leakage rate by CPA on known data. 

— DPA requires more sample curves anyway since all the unpredicted data bits 
penalize the signal to noise ratio (see [5]). 

— If DPA fails by lack of implementation knowledge (increasing the number of 
curves does not necessarily help), we have shown how to infer a part of this 
information without excessive efforts: for instance the reference state is to 
be found by exhaustive search only once in general. 

— There exists many situations where the implementation variants (like SBox 
implementation in DES) are not so numerous because of operational con- 
straints. 

— If part of the model cannot be inferred (SBox implementation in DES, hard- 
ware co-processor), partial correlation with the remainder may still provide 
exploitable indications. 
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Eventually DPA remains relevant in case of very special architectures for which 
the model may be completely out of reach, like in certain hard wired co- 
processors. 
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Abstract. Since their publication in 1998, power analysis attacks have 
attracted significant attention within the cryptographic community. 

So far, they have been successfully applied to different kinds of (un- 
protected) implementations of symmetric and public-key encryption 
schemes. However, most published attacks apply to smart cards and only 
a few publications assess the vulnerability of hardware implementations. 

In this paper we investigate the vulnerability of Rijndael FPGA (Field 
Programmable Gate Array) implementations to power analysis attacks. 

The design used to carry out the experiments is an optimized architecture 
with high clock frequencies, presented at CHES 2003. First, we provide 
a clear discussion of the hypothesis used to mount the attack. Then, we 
propose theoretical predictions of the attacks that we confirmed exper- 
imentally, which are the first successful experiments against an FPGA 
implementation of Rijndael. In addition, we evaluate the effect of pipelin- 
ing and unrolling techniques in terms of resistance against power analysis. 

We also emphasize how the efficiency of the attack significantly depends 
on the knowledge of the design. 



1 Introduction 

Side-channel analysis is becoming a classical topic in cryptographic design, 
but although numerous papers investigate Differential Power Analysis (DPA) 
from a theoretical point of view, only a few articles focus on their practical 
implementation. Moreover, most of the published research is related to smart 
cards and only a few papers assess the context of hardware and FPGA 
implementations. 

As soon as hardware design is concerned, the questions of effectiveness, 
clock frequency and area requirements are of primary importance. In this paper, 
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we demonstrate that they also have a very substantial impact on the feasibility 
of power analysis attacks. For this purpose, we investigated an optimized FPGA 
implementation of the Advanced Encryption Standard Rijndael [1,2], presented 
at CHES 2003. In addition to the practical evaluation of the attack, we present 
a number of original observations concerning: (i) the effect of pipelining and 
unrolling techniques in terms of resistance against power analysis attacks; (ii) 
the relationship between the knowledge of a hardware design and the efficiency 
of power analysis attacks, (in) the effect of high clock frequencies on the 
measurement setup. Moreover, we characterized some design components ( e.g . 
the registers) in terms of predictablility and leakage. This results in tools that 
could be used to analyze power analysis attacks in general. Finally, we compare 
our results with the only published attack against a hardware implementation 
of Rijndael that we are aware of [3] to validate our conclusions. 

This paper is structured as follows. Section 2 presents the hypothesis used to 
carry out the power analysis attack and Section 3 gives a short description of 
our Rijndael implementation. Section 4 describes how to perform theoretical 
predictions on the power consumption in a pipeline design and Section 5 explains 
how to use these predictions in order to mount a practical attack. Section 6 
presents theoretical predictions of the attack and their practical implementation 
is discussed in Sect. 7. Additional considerations about pipeline and unrolled 
designs are presented in Sect. 8. Section 9 re-discusses the hypothesis. Finally, 
conclusions are in Sect. 10. 



2 Hypothesis 

In Differential Power Analysis, an attacker uses a hypothetical model of the 
device under attack to predict its power consumption. These predictions are 
then compared to the real measured power consumption in order to recover 
secret information (e.g. secret key bits). The quality of the model has a 
strong impact on the effectiveness of the attack and it is therefore of primary 
importance. 

While little information is available on the design and implementation of 
FPGAs (much of the information is proprietary), we can make assumptions 
about how commercial FPGAs behave at the transistor level. The most popular 
technology used to build programmable logic is static RAM 1 , where the storage 
cells, the logic blocks and the connection blocks are made of CMOS gates. For 
these circuits, it is reasonable to assume that the main component of the power 
consumption is the dynamic power consumption. For a single CMOS gate, we 
can express it as follows [5]: 



Pd =C l V d I Po-i/ (1) 

where C L is the gate load capacitance, Vjm the supply voltage, Po-s-i 
the probability of a 0 — > 1 output transition and / the clock frequency. 

1 For all the experiments, we used a Xilinx Virtex XCV800 FPGA [4], 
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Equation (1) specifies that the power consumption of CMOS circuits is clata- 
dependent. However, for the attacker, the relevant question is to know if this 
data-dependent behavior is observable. This was confirmed by the following test. 

Let three 4096-bit vectors be defined as follows. Initially, ao = 00000. ..001 and 
bo, cq = 00000. ..000. Then: 



a±+i = SL(a±), bi+i = b± © a±, c± +1 = c± ® b±, 

where SL is the shift left operator and consecutive values (x±, ar±+i) are separated 
by a register. It is easy to see that: 

— a is a bit- vector with a constant Hamming weight ( H{a ) = 1). The position 
of the 1-bit inside the vector is incremented/decremented from 0 to 4095. 

— b is a bit-vector for which the Hamming weight is incremented/decremented 
from 0 to 4095. 

— c is a bit- vector for which the number of bit switches between two consecutive 
states is incremented/decremented from 0 to 4095. 

A design that generates these three vectors was implemented in the FPGA. 




Fig. 1. One single power trace Fig. 2. Preliminary test 

Figure 1 illustrates 2 a single power trace. Figure 2 illustrates the power 
consumption of vectors a, b and c during about 20 000 clock cycles. From this 
experiment, we conclude that the power consumption clearly depends on the 
number of transitions in registers. 

Based on these considerations, we used the following hypothesis to mount 
power analysis attacks against FPGAs: “an estimation of a device power 
consumption at time t is given by the number of bit transitions inside the device 
registers at this time” . Predicting the transitions in registers is reasonable since 
registers usually consume the largest part of the power in a design. 

2 Measurement setups for DPA have already been intensively described in the open 
literature. In Fig. 1, we observe the voltage variations over a small resistor inserted 
in the supply circuit of the FPGA. Every trace was averaged 10 times in order to 
remove the noise from our measurements. 
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3 Hardware Description 

A short description of the Rijndael algorithm is given in the Appendix A. The 
architecture used to investigate DPA against Rijndael was presented last year 
at CHES 2003 [6]. We briefly describe its significant details. 

SubBytes: The substitution box (S-box) is implemented as a 256 x 8 
multiplexer and takes advantage of specific properties of the FPGA. Note 
that two pipeline stages are inserted for efficiency purposes, as represented in 
Appendix B. In SubBytes, this S-box is applied to the 16 bytes of the state in 
parallel. 

Mix Add: In [6], an efficient combination of MixColums and the key addition 
is proposed, based on an optimal use of the FPGA resources. The resulting 
MixAdd transform allows MixColumns and AddRoundKey to be computed in 
two clock cycles, the key addition being embedded with MixColumns in the 
second cycle. 

Complete architecture: The complete architecture is represented in 

Fig. 3, where all the registers are 128-bit long 3 . It is a loop architecture 
with pipeline, designed for optimizing the ratio Throughput (Mbits / s) / Area 
(slices). It is important to remark that the multiplexer model for the S-box 
implies that its first part uses four 128-bit registers. The resulting design 
implements the round (and key round) function in 5 clock cycles and the 
complete cipher in 52 clock cycles. 

4 Predictions in a Pipeline Design 

The question we assess in this paper is to know whether pipelining has any 
influence on DPA resistance. We also investigate a practical design that is the 
result of efficiency optimizations. Loop architectures are a relevant choice for 
investigation because they satisfy the usual area and throughput requirements 
for block cipher applications. However, unrolled architectures will also be 
explored in a further section. 

Based on the hypothesis of Sect. 2, the first step in a power analysis at- 
tack is to make theoretical predictions on the power consumption. This can 
be done using a selection function D that we define as follows. Let X± and 
Xl+i be two consecutive values inside a target register. An estimation of the 
register power consumption at the time of the transition is given by the function 
D = H(X i® Xi + i). An attacker who has to predict the transitions inside the 
registers of an implementation therefore needs to answer two basic questions: 

1. Which register transitions can we predict? 

2. Which register transitions leak information? 

Answering these questions determines which registers will be targeted during 
the attack. As an attacker can use the plaintexts (resp. ciphertexts) and pre- 
dict transitions by partial encryption (resp. decryption), it is also important to 
evaluate both scenarios. 

3 Except the first part of Mixadd that is 176-bit long. 
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4.1 Definitions 

i. The predictability of a register is related to the number of key bits one should 
know to predict its transitions. For block ciphers, this depends on the size of 
the S-boxes and the diffusion layer. In practice, it is assumed that it is possible 
to guess up to 16 key bits, and the diffusion layer usually prevents guessing 
of more than one block cipher round. In Rijndael, S-boxes are 8-bit wide and 
their outputs are thus predictable after the first ( resp . final) key addition. 
However, every MixColumns output bit depends on 32 key bits and is therefore 
computationally intensive to guess. 

ii. We denote a register as a full (resp. empty ) register if its transitions leak 
(resp. do not leak) secret information. For example, it is obvious that an input 
(resp. output) register does not leak any secret information as it only contains 
the plaintext (resp. ciphertext). A surprising consequence of the hypothesis 
introduced in Sect. 2 is that the registers following an initial (resp. final) key 
addition do not leak information either. To illustrate this statement, we use the 
following key addition: 

AddKey 

{ result = input © key, } 

Let assume that the result is actually stored in an FPGA register R . 
Let two consecutive inputs of the key addition be denoted as input i and inputs- 
Using the previously defined selection function, the register power consumption 
may be estimated by: 

P R oc H(result\ © result 2 ) = H (inputi © key © inputs © key) 

= H(inputi © inputf) (2) 

Equation 2 clearly specifies that the register R is empty. In practice, registers of 
our Rijndael implementation will actually remain empty as long as the state has 
not passed through the non-linear S-box. Thereafter, the power consumption 
depends on H(sbox(input\®key)®sbox(input 2 ®key)) and therefore on the key. 

Remark that this observation strongly depends on the hypothesis and se- 
lection functions used to perform the attack, what we will discuss further in 
Sect. 9. Another surprising observation is that the register R may still leak 
secret information if reset signals are used. This is due to the constant state 
that reset signals introduce. Then, we have: 

P R oc H( u all zeroes ” © result i) = H( a all zeroes ” © inputi © key) 

= H (inputi © key) (3) 

which makes the power consumption dependent on the key again. As a conse- 
quence, a secure hardware implementation should not apply reset signals to its 
inner registers in order to delete this additional information leakage. Note that a 
similar observation has been used to attack smart card implementations, where 
the constant state actually corresponds to a constant instruction address. 
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Fig. 3. Encryption predictions. 



4.2 Predictions in Rijndael 

Figure 3 illustrates predictable and full registers when our AES design is filled 
with 5 different texts, denoted 1,2,..., 5, during the first eight clock cycles of 
an encryption. As an example, during the first cycle, register Til contains the 
plaintext 1 while all the other registers are undefined. During the second cycle, 
R1 contains the plaintext 2, R2 contains the plaintext 1 and the other registers 
are undefined. Remark that in the eighth cycle, the multiplexer starts to loop 
and register R3 therefore contains data corresponding to plaintext 1 again. 

Similarly, Figure 4 illustrates predictable and full registers when our 
AES design is filled with 5 different texts, denoted 1,2,. . . ,5, during the last six 
clock cycles of an encryption. As an example, the register R12 contains the first 
ciphertext in the second cycle, ciphertext 2 in the third cycle and ciphertext 3 
in the fourth cycle. 

In the next section, we explain how theoretical predictions of the power 
consumption can be used to attack an FPGA implementation of Rijndael. 



5 Description of a Correlation Attack 

A correlation attack [3,7] against an FPGA implementation of Rijndael is divided 
into three steps. Let N be the number of plaintext/ciphertext pairs for which the 
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Fig. 4. Decryption predictions. 



power consumption measurements are accessible. Let K be the secret encryption 
key. When simulating the attacks, we assume that I\ is known to the attacker. 
In case of practical attacks, it is of course unknown. 



Prediction phase: For each of the N encrypted plaintexts, the attacker 

first selects the target registers and clock cycle for the previously defined 
selection function D. In Fig. 3, we see that between cycles 7 and 8, reg- 
isters R4,R5,R6,R7,R8,R11 and R12 are full and have predictable and 
defined values. Similarly, in Fig. 4, we observe that between cycles 1 and 
2, registers R3, R4, R5, R6, R7 and 7?10 are full and have predictable and 
defined values. Depending on the knowledge of the design, these registers can 
therefore be targeted. Due to the size of the Rijndael S-box, the predictions 
are performed on 8 bits and may be repeated for every 8-bit part of a register Ri. 

Let t be the number of 8-bit registers targeted by the attacker. Then, he 
predicts the value of D ( i.e . the number of bit switches inside the target 
registers in the targeted clock cycle) for the 2 8 possible key guesses and N 
plaintexts. The result of the prediction phase is an N x 2 8 selected prediction 
matrix, containing integers between 0 and 8 x t. For simulation purposes, it 
is also interesting to produce the global prediction matrix that contains 




